Method and system for document presentation and analysis

ABSTRACT

A document analysis and search system may include a program module storable on a client device positioned in communication with a network which, in turn, is in communication with a document provider database and a thesaurus database. The program module may include instructions executable by a processor of the client device to locate at least one document from among the plurality of documents. The program module may include an interface module and a document analysis module. The interface module may receive concept data relating to the subject matter of the search and a plurality of documents relating to the concept data from the document provider database. The interface module may generate and display a document analysis graphical user interface.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/501,370 filed on Apr. 11, 2012 by the inventors of the presentapplication and titled Method and System for Document Presentation andAnalysis which, in turn, claimed the benefit under 35 U.S.C. §371 ofInternational Application No. PCT/US10/52321, the entire contents ofeach of which are incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to the field of document analysis, andmore particularly to methods and systems for rapidly determiningrelevancy of one or more documents.

BACKGROUND

Document research involves indentifying relevant subject matter orconcepts within a document or set of documents. Search engines, forexample, use “key” words or phrases as search arguments to locate textpassages containing those words or phrases. The passages may or may notbe relevant, however, regardless of the instance of the argument.Finding relevant subject matter involves not just the instance of theword or phrase, but the context in which it is found. The preceding andsucceeding words that surround a keyword in a passage influence themeaning or effect of its use.

Sometimes the search for context, as opposed to an instance of akeyword, can be narrowed by using additional descriptive terms. Booleanoperators are used by almost all search engines to link words separatedby the operators in some logic set. For example, the operator “AND”implies the set of all instances of word number one used in conjunctionwith word number two; the operator “OR”, by contrast, implies the set ofall instances of word number one combined with the set of all instancesof word number two. In mathematical language, the first set is anintersection set and the second, a union set.

Wildcards, indicated by some symbol like “*” or “$”, can be used tosubstitute for letters, prefixes or endings, thereby picking up thealternative forms in which a word might appear. Proximity indicators,such as “ADJ”, “NEAR”, “WITH” and “SAME”, are used together with Booleanoperators to indicate how far apart two words may appear in a textpassage. This gives the document researcher a means for assessingcontext. Two words used in the same sentence, or in the same paragraph,can indicate a contextual nexus.

In the current state-of-the-art, finding contextual meaning involvesreading whole passages or entire documents where keywords are located.Since the quality of document research is defined in the negative as notmissing any relevant passages in a field of inquiry, the researcher canill-afford to simply spot-read. Search engines can find the keywords,but it is the reading task that defines not only the quality but thetime spent on a properly conducted search exercise. Any artifice whichreduces reading time without compromising quality becomes highlydesirable for productivity reasons.

U.S. Published Application No. 20050210042 to Goedken shows methods andapparatus to search and analyze prior art. Goedken shows the benefit ofgrouping conceptually related words to a single color, and thenhighlighting those words in the text of a patent document. Goedken alsorecognizes the benefit of counting elements for reporting purposes (seeFIG. 14 a). Goedken, however, does not show a system for rapidlydisplaying the text of a document alongside an indexed color coded chartfor allowing quick navigation and quickly showing the user prevalence ofvarious concepts inside of a document. These are important shortcomingsbecause the patent researcher requires a system for acquiring an initialunderstanding of a document in 1-2 seconds. The patent researcher mustview thousands of documents in a typical search, and if the initialdocument inquirytakes more than a few seconds, then a patent search canbecome economically unfeasible.

U.S. Published Application No. 20060156222 to Chi shows a method forautomatically performing conceptual highlighting in electronic text. Chihas also noticed that conceptually related words can be grouped togetherand highlighted the same color. However, Chi has not provided foradditional features that enable rapid initial understanding of adocument. For example, Chi doesn't teach methods of removing passages ofno relevance to the reader's interest. In addition, Chi doesn't showmethods of removing all but the most relevant passages. Moreover, Chialso doesn't show a method of providing rapid understanding (1-2seconds) of a document, such that a researcher can make the quickdecision of whether or not to start reading a document.

U.S. Pat. No. 7,194,693 to Cragun shows an apparatus and method forautomatically highlighting text in an electronic document. However,highlighting is determined by user preferences and scroll speed. Cragundoes not show features that allow rapid, staged understanding of adocument that are required by the researcher wrestling with largenumbers of long documents.

U.S. Pat. No. 6,823,331 to Abu-Hakima shows a concept identificationsystem and method for use in reducing and/or representing text contentof an electronic document. Although Abu-Hakima provides for counting andranking, there are no tools for rapid understanding of the document onceit is presented.

U.S. Published Application 20090276694 to Henry shows a System andMethod for Document Display. Like the present invention, Henry has foundthe usefulness in presenting reference characters along with names on ornear the figures to which they relate. However, Henry has not taught asearch system where the reference characters are rapidly located for thesearcher, and presented for quick navigation through the document.Moreover, Henry has decided to retrieve characters from drawings, wherethe present invention contains a method for hunting patent text forreference characters.

U.S. Published Application 20040113916 to Ungar shows a perceptual-basedcolor selection for text highlighting. The text color choice is basedupon factors such as the total amount of highlighted display.

Several problems still exist in prior art. First, most search systemsrely on a researcher to limit a document set using a combination ofkeyword and classification. But since a researcher is looking formultiple concepts simultaneously, limiting a search with a set ofkeywords will inevitably miss references showing the concepts that werenot part of the immediate search. This is exacerbated when a searcher islooking for ten or more concepts simultaneously. Clearly, a bettersystem would involve reviewing large sets of documents for all conceptssimultaneously. However, the labor involved in reading large sets oflong documents makes this approach impractical. Therefore, a system isrequired that enables rapid manual review of large sets of lengthydocuments for multiple concepts simultaneously.

Embodiments of the present invention address many of the shortfalls inthe prior art while presenting, what will hereinafter become apparent tobe, a pioneering document analysis technology.

SUMMARY OF THE INVENTION

It is a first object of the present invention to enable location andloading of groups of words having relevance in a research project. It isa second object to provide an interface that enables rapid (1-2 second)first level of relevance determination through color coding of conceptblocks. Yet another object of the present invention is to provide aninterface that enables quick (5-10 second) second level of relevancedetermination through multi-colored highlighting of keywords. It is yetanother object to provide multiple user options for removal ofnon-relevant passages in a document. Yet another object is to providefor optional display of only the highest relevance passages for highspeed patent searching. Still another object is to provide an interfacethat enables rapid location in patent text of any reference numeral fromthe figures. Yet another object is to provide an interface that enablesrapid location of passages related to figure numbers. Still anotherobject of the present invention is to provide an interface with rapidlocation of patent and published application numbers inside a body oftext.

The present invention is a document presentation system that enables aresearcher to quickly assess relevance of a document in the context of asearch project. With the present invention, the researcher can locatepotentially relevant areas of a document database, and then review largenumbers of documents for the presence of multiple concepts. Theinvention contains GUI tools that enable the researcher to first loadmultiple keyword groups into blocks of conceptually related keywords. Asthe researcher navigates from document to document, the keywords arecounted, and the keyword blocks are colored according the highestkeyword occurrence in each keyword block. This enables the researcher tomake a first level of relevance determination within a 1-2 seconds ofloading the document. If multiple colors aside from red are observed,the researcher can then inspect for passages of relevance. Only passagescontaining a user specified number of keywords are presented, so thatthe researcher does not read and page through long documents. Inaddition, each passage has all keywords color coded, such that allkeywords from a given block are made the same color. When the researcherobserves multi-colored passages, he or she can quickly inspect thepassage by scanning from keyword to keyword—enabling a second level ofunderstanding within just 5-10 more seconds. In addition, the researcheris provided with the ability to scroll the document from keyword tokeyword by clicking in the keyword blocks. Particularly dense keywordareas are shown on a keyword density scrollbar enabling the researcherto jump directly to keyword dense sections of the document. In addition,the researcher can instruct the interface to automatically remove allbut the most relevant passages—which are defined as those with thehighest number of keyword blocks represented therein. Moreover, thedocument is processed to present a bill of material (BOM) table and afigures table, both of which provide document navigation. With thesenavigation tools, a patent researcher can view patent images in onewindow and quickly locate passages in the text where referencecharacters reside (using the BOM table) or where figures are discussed(using the figs table). In addition, the interface presents any patentnumbers or published application numbers discussed in the document,which provides quick adding of applicant cited documents to a standardbackward citation search. An additional tool provides the ability to tageach document according to relevance and according to presence orabsence of multiple user defined concepts.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram illustrating a document analysis system inaccordance with an exemplary embodiment of the invention.

FIG. 1B is a sample of a document.

FIG. 2A is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2B is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2C is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2D is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2E is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2F is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2G is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2H is a diagram of a project file created and used by the presentinvention.

FIG. 2I is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2J is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 2K is an interface diagram in accordance with an exemplaryembodiment of the invention.

FIG. 3A is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3B is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3C is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3D is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3Ee is a flow diagram illustrating a process that may be carriedout in accordance with the exemplary system of FIG. 1.

FIG. 3F is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3G is a flow diagram illustrating a process that may be carried outin accordance with the exemplary system of FIG. 1.

FIG. 3H is a block color scheme table.

FIG. 3I is a document text color scheme table.

FIG. 4 is a flow diagram illustrating another process that may becarried out in accordance with the exemplary system of FIG. 1.

FIG. 5 is a block diagram illustrating a document analysis system inaccordance with another exemplary embodiment of the invention.

DETAILED DESCRIPTION

Reference will now be made in detail to the present exemplaryembodiments of the invention, examples of which are illustrated in theaccompanying drawings.

Referring to FIG. 1, a block diagram is shown illustrating a documentanalysis system 100 in accordance with an exemplary embodiment of theinvention. The document analysis system 100 comprises a client device110. The client device 110 includes a document analysis module 112, aninterface module 114 and a user Input/Output (I/O) interface 118. By wayof example, the client device 110 may be a computing device having aprocessor such as personal computer, a phone, a mobile phone, or apersonal digital assistant. The document analysis system 100 may alsocomprise a document provider 130 and a network 120. The documentprovider 130 is configured to deliver one or more documents, labeledgenerally as 132. By way of example, the documents 132 may be electronicfiles containing patent data or any type of electronic file thatcontains textual data. See FIG. 1B for an example of a document 132. Asseen, the document 132 has multiple document classifications 135 thatare further divided into a class 136 and a subclass 137. In addition,notice the body of the document is composed of multiple sections (eg.Abstract, description,claims), and that section are further divided intodocument paragraphs 138. The document 132 may also contain BOM items267, which are also known as reference characters, patent referencenumbers 260, and figure numbers 268. The document provider 130 may be aremote server running a search engine such as that provided by theUnited States Patent and Trademark Office (USPTO) FreePatentsOnLine,Micropatent®, Delphion®, PatentCafe®, Thompson Innovation or Google®.The document provider 130 may retrieve the data from a local repositoryor from one or more remote documents repositories. Examples of such adocument repository include patent databases including those provided byEP (European patents), WO (PCT publications), JP (Japan abstracts) andDWPI (Derwent World Patent Index for patent families). The documentprovider 130 may alternatively be a cloud based bulk storage system suchas Amazon Simple Storage Service. The interface module 114 is configuredto receive one or more documents 132 from the document provider 130 byway of network 120. By way of example, the network may be the Internet.The interface module 114 may alternatively be configured to receive oneor more documents 132 through the user I/O interface 118. In such anembodiment, the documents 132 may be stored on a portable storage device(not shown) such as a CD, DVD or solid state device and the user I/Ointerface 118 may include a communications interface such as a wirelessinterface, a CD/DVD drive or a USB drive for retrieving data from thepersonal storage device. The documents 132 may alternately bepaper-based documents and may be provided to the interface module 114 byuse of a scanner (not shown) that is configured with the I/O interface118. The client device 110 may also include a data storage element 116.The interface module 114 is also configured to receive a set of one ormore concepts from a researcher by way of the I/O interface 118. The I/Ointerface 118 may also include at least one input device such as akeyboard, mouse, microphone or a touch screen for receiving the conceptsfrom the researcher. Each concept is comprised of one or more text-basedkeywords or sets of text-based keywords which are used by the documentanalysis module 112 to analyze each of the documents 132. The documentanalysis module 112 generates statistical data based on the user-definedconcepts and the documents 132. The statistical data may be used by theresearcher to quickly assess the relevancy of each document 132 to eachof the user-defined concepts. The document analysis module 112 maytransmit the statistical data to the interface module 114 which presentsthe data to the researcher by way of the I/O interface 118. The I/Ointerface 118 may also include a display such as an LCD or CRT monitorconfigured to display a graphical user interface (GUI) for presentinginformation such as the statistical data to the researcher. The GUI willnow be discussed in greater detail.

Referring now to FIG. 2A FIG. 2B, FIG. 2C, 2D, 2E, and 2F, diagrams areshown illustrating a document analysis graphical user interface (GUI)200 in accordance with an exemplary embodiment of the invention. FIG.3A-F which illustrates an exemplary computer-implemented process 300 forperforming document analysis will also be discussed. At a first steplabeled as 310, the interface module 114 will receive concept data fromthe researcher. The interface module 114 first generates a documentanalysis GUI 200 and displays the GUI 200 to the researcher by way ofthe display device included with user I/O interface 118. As shown inFIG. 2A, the document analysis GUI 200 includes a document relevanceinterface 220, a document management interface 250, and a document imagewindow 254. The document image window 254 displays non-textual data suchas images or drawings that may be associated with the currently selecteddocument thus providing an additional means for assessing the relevanceof the document. As seen in FIG. 2F, the researcher may start a researchproject by entering one more concepts 272. Each concept 272 may have oneor more words or word groups associated therewith. As shown in FIG. 2B,the document analysis GUI 200 includes a keyword entry interface 210.The keyword entry interface 210 comprises multiple rows of alphanumericentry fields 212. One or more keywords 213 may be entered by aresearcher into each entry field 212, wherein each keyword 213 isconceptually related such that each line represents a keyword group 214.The researcher is also provided with a user thesaurus 211 and webthesaurus 219. The user thesaurus 211 can be edited and stored in a datastorage element 116, and the web thesaurus 219 may be accessed throughthe network 120 by the interface module 114. Five alphanumeric entryfields 212 are shown to be filled in FIG. 2B. Each concept 272 andcorresponding keyword group 214 may be determined manually by theresearcher or may be received from an external source. By way ofexample, the concepts may be reduced to a manageable number of concepts(e.g. 4-5 concepts). Keywords 213 may then be chosen for each of theconcepts and entered into one of the alphanumeric fields 212 to form thekeyword group 214. After entering each of the desired concepts, theresearcher may then exit the keyword entry interface 210 and proceed toanalysis of a set of documents based on the user-defined concepts.

At a next step labeled as 320 the interface module 114 will receive oneor more documents 132. As discussed the interface module 114 isconfigured to receive the one or more documents 132 from the documentprovider 130 by way of network 120. The interface module 114 may beconfigured to allow the researcher to request a predetermined set ofdocuments 132. By way of example, the researcher may initiate a requestfor a specific set of patent documents or a set of patent documents thatfall within a specific category or classification. The researcher mayalso initiate a search of a remote document repository through a searchinterface window 230 (shown in FIG. 2D) provided by the documentanalysis GUI 200. The search may be initiated by entering a set ofsearch parameters, such as keywords, into one or more search fields 232located on the search interface window 230. Boolean operators, wildcardsand proximity indicators may be used to link the keywords together inlogic sets. The search interface window 230 may also provide a searchassistance window 234 that allows the previously defined keywords 213 tobe added to the set of search parameters in response to a user action(e.g. a mouse click). The search assistance window 234 therebyfacilitates the loading of search parameters into the one or more searchfields 232. In addition, the researcher is provided with aclassification search list 290, which contains a table for documentingthe search project strategy (discussed in detail later). The researchermay pick classification codes from the classification search list 290.As discussed, the interface module 114 may alternatively be configuredto receive one or more documents 132 through the user I/O interface 118.In such an embodiment, the documents 132 may be stored on a portablestorage device (not shown) such as a CD, DVD or solid state device andthe user I/O interface 118 may include a communications interface suchas a wireless interface, a CD/DVD drive or a USB drive for retrievingdata from the personal storage device. Upon receiving the one or morereference documents 132 the interface module 114 will populate adocument management table 252 located on a document management interface250 (shown in FIG. 2E) with selectable rows 253 each having informationdescriptive of one of the received reference documents 132. By way ofexample, each row may include a reference document number 255 anddocument title 256.

At a next step, labeled as 330, the document analysis module 112performs analysis of the one or more reference documents 132 received bythe interface module 112 relative to the user-defined concepts alsoreceived by the interface module 112. As shown in FIG. 2C the documentanalysis GUI 200 includes the document relevance interface 220. Thedocument relevance interface 220 comprises a keyword table 222 and adocument text window 226. When the researcher selects (by way of a mouseclick or similar navigation event) one of the rows that appear in thedocument management table 252, processed text 228 or corresponding textof the reference document becomes viewable in the document text window226. Each keyword entered in alphanumeric entry fields 212 is listed ina separate row of a first column 223 of the keyword table 222. Thekeyword table 222 also includes a second column 224. The second column224 displays a numeric value that represents the number of times thecorresponding keyword in the first column 223 appears in the processedtext 228 of the currently selected document. The keywords are arrangedin keyword blocks 225, wherein each block 225 contains all of thekeywords from a single keyword group 214. In addition, each keywordblock 225 has a highest occurring keyword 235, which is the highestoccurring keyword from the block. The keyword blocks 225 may be visuallyseparated by bold horizontal bars, labeled generally as 229. When adocument is first selected by the researcher, the document analysismodule 112 will retrieve the document 132 through the interface module114 and generate the processed text 228. The document analysis module112 will use a block color scheme 236 to determine a color for eachkeyword block 225. According to the block color scheme 236, the color isdetermined from the highest occurring keyword 235 in each keyword block225. The keyword table colors are selected by the document analysismodule 112 from one of a set of predetermined colors in the block colorscheme, each color corresponding to a range of instances of appearancesof a keyword in the document 132. See FIG. 3H for an example of a blockcolor scheme 236. As seen, red signifies lowest occurrence, and greensignifies highest occurrence. All intermediate integers receivedifferent colors along a red-green continuum. After determining a colorfor each keyword block 225, the document analysis module 112 willinstruct the interface module 114 at step 340 to highlight eachcorresponding block 225 with that color. By viewing the colored keywordblocks 225 in the keyword table 222, a researcher may then make a rapiddecision regarding the potential relevance of the selected document 132to the previously defined concepts. More specifically, the researchercan use the colored keyword blocks 225 to make an initial relevanceassessment within 1-2 seconds. If multiple colors, other than red, areobserved in the initial relevance assessment, the researcher may thenscan the processed text 228 to locate paragraphs having multiple colors,which would correspond to multiple concepts. If multi-colored paragraphsare noticed, the researcher may then decide to read portions of theprocessed text 228 to make a second determination as to relevance within5-10 seconds. Finally, a researcher may view the original document 132in the document image window 254 to make a final determination fortagging the document 132.

In addition, the count of instances for each keyword 213 may betransformed by the document analysis module 112 into a normalized countso that the length of the selected document 132 is substantiallyeliminated as a variable. The computation for the normalized countinvolves dividing the totality of the text characters in the selecteddocument by five (average letter count for a word in the Englishlanguage) to a normalized word count. Next, the count of instances foreach keyword 213 is divided by the normalized word count to finddensity. This is followed by multiplying density by 2500 (arbitraryconstant) and rounding to result in the normalized count expressed inintegers. In one aspect of the exemplary embodiment, one of the keywordtable colors is associated with a normalized count value of 10 orgreater, another keyword table color with a value of 9, and a thirdkeyword table color with a value of 8, and so on until the zero color isassigned. Steps 330 and 340 may be repeated for each of the receivedreference documents 132 as indicated by dashed arrow 350.

As seen in FIG. 2C, a keyword density scrollbar 227 may also be providedhaving integrated colors which correspond to such sections of text wherehighlighted keywords are tightly grouped. By way of example, thescrollbar 227 may be divided vertically into density sections 238,wherein the number of sections corresponds to the number of documentparagraphs 138 appearing in the processed text 228. Colors may beassigned to each density section 238 according to the number of keywordgroups 214 that appear in each document paragraph 138. The researchercan then rapidly scroll through long documents directly to areas wheremultiple keyword groups 214 are represented.

As discussed, when the researcher selects one of the rows that appear inthe document management table 252 the processed text 228 of thecorresponding reference document 132 becomes viewable in the documenttext window 226 and an image of the document 132 becomes viewable in thedocument image window 254. In addition, the document analysis module 112will assign a unique keyword color to each block of keywords (each blockof keywords corresponding to one concept) for subsequent highlighting inthe document text window 226. Thereby, each keyword within a keywordblock 225 or logical set of keywords will have the same unique color.The document analysis module 112 then instructs the interface module 114at step 340 to display the keywords highlighted with the correspondingunique keyword colors in the document text window 226. In this manner, ascrolling scan of the displayed text may reveal sections of text wherehighlighted keywords are tightly grouped together. When keywordshighlighted with different colors appear within a section, such alocalized array might indicate a confluence of concepts and a nexus ofcontext. The need for reading can be reduced by the collage ofhighlighted words in the localized array, the collage potentiallycommunicating the meaning of a passage in the same way that a word withmissing letters is recognizable. Thus a quick confirmation of relevancecan be made by a person in a glancing inspection.

With reference now to FIGS. 3 b-3 g, the generation of the documentrelevance interface 220 will be discussed in greater detail. As seen inFIG. 3B, four basic inputs are the document 132, the keyword groups 214,the static parameters 240, and the interface settings 245. With theseinputs, the document analysis module 112 runs processes 600, 630, 640 togenerate the document relevance interface 220.

Process 600 Generate Processed Text 228

Referring to FIG. 2C and FIG. 3C process 600 begins when the researchernavigates to a document 132 using the document management interface 250.At 601, if section selector 218 is Bill of Material or (BOM) thenproceed to 602, where the description field of the document is selectedand passed to step 650. Here the “Build BOM” subroutine is executed andthe resulting text becomes the processed text 228, which is displayed inthe document text window 226. The result is a single column of thereference characters followed by item names. Returning to step 601 andproceeding to Step 604, if section selector 218 is “Class”, then proceedto step 605, and select all document classifications 135. Next, at step606, retrieve full class schedules for each document classification 135,which becomes the processed text 228 and is displayed in the documenttext window 226. Returning to step 604 and proceeding to Step 607, ifsection selector 218 is “Citations”, then proceed to step 608. Selectthe Citations section of the document 132, and proceed to step 609.Select the description section of the document, and proceed to process640. Append the examiner citations from the citations section to thepatent and application numbers found in process 640. The resultingdelimited list of patent reference numbers becomes the processed text228, which is displayed in the document text window 226. Returning tostep 607 and proceeding to Step 611, if Summary Only or SO=Yes, thenproceed to step 612 and remove all text related to prior art andbackground by searching for words such as SUMMARY or BRIEF SUMMARY.Proceeding to step 613, first select the document section (ie. ifsection selector 218 is “Claims”, select the claims section). Next,separate the selected section into an array of paragraphs using carriagereturns as the delimiter to make 1d-array 670. Next, count the totalnumber of occurrences of any keyword from each keyword group 214 in eachparagraph in 1d-array 670, and store as 2d-array 671. Next, use the2d-array 671 to find the number of different keyword groups 214represented in each paragraph (ie the number of non-zero cells in eachrow of 2d-array 671), and store as 1d-array 672. Next, if KeywordSetting or KW Setting 215=KW1, then proceed to step 615, and remove allparagraphs from the 1d-array 670 having a corresponding number in1d-array 672 of zero (so that the end display shows only paragraphs withat least one keyword group 214 represented). Returning to step 614, andon to step 616, if KW Setting 215=KWII, then proceed to step 617, andremove all paragraphs from the 1d-array 670 having a correspondingnumber in 1d-array 672 of zero or one (so that the end display showsonly paragraphs with at least two keyword group 214 represented).Returning to step 616, and on to step 618, if KW Setting 215=KW Hot,then proceed to step 619, and remove all paragraphs from the 1d-array670 having a corresponding number in 1d-array 672 that is less than thehighest number found anywhere in 1d-array 672 (so that the end displayshows only paragraphs with the highest number of keyword groups 214represented). Next, assign colors to each density section 238 of thekeyword density scrollbar 227 using the 1d-array 672 and a color schemeof 1) green=highest number in 1d-array 672, 2) red=0, 3) allintermediate numbers receive an intermediate color along the red-greenspectrum. Moving now to step 620, assign unique colors to each keywordgroup 214 using a document text color scheme 237, wherein each color ispicked for its ability to stand out on white background and also becontrasted from the other colors. See an example of the document textcolor scheme in FIG. 3 i. At, step 621, if highlight setting 217=AII,then proceed to step 622 and convert 1d-array 670 to regular text, andhighlight all keyword according to color scheme developed in step 620.Returning to step 621, and on to step 623, convert 1d-array 670 toregular text, and highlight the keywords in the visible window accordingto color scheme developed in step 620. Display as the processed text 228in the document text window 226 of the document relevance interface 220.

Process 630 Generate Keyword Table 222:

Referring to FIG. 3D, first at step 631, count the total number of eachkeyword 213 and store in 1d-array 673. Isolate the highest numberrepresenting each keyword group 214 from 1d-array 673, and store to1d-array 674. Next at step 632, assign colors to 1-d array 674 accordingto the block color scheme 236 from FIG. 3H (i.e. red=0, yellow=5,green=10 or more, all intermediate numbers between 0 and 10 get adifferent color along a red-green continuum). Arrange the keyword groups214 for display in the first column 223, and 1-d array 673 in the secondcolumn. Separate each keyword group 214 with a horizontal bar 229 toform multiple keyword blocks 225. Index processed text 228 againstkeywords 213, such that mouse clicks in any row will cause scrolling tokeyword locations in text 228. As seen in FIG. 2 i, the index providesthe researcher with rapid scrolling to and bolding of the keyword thatis clicked in the keyword table.

Process 640 Generate Patent References 260:

Referring to FIG. 3E, first step 641, Select the Description from thedocument 132, and convert all characters to lower case. Remove allnon-alphabet and non-numeric characters such as slashes, commas,periods, etc. Next, at step 642, hunt for any words preceded by thephrases such as: “patent”, “us”, “u.s.”, “no.” If words are numeric,then add to a 1d array 675 of patent reference numbers 260. Next, huntfor any words that are 6, 7, or 11 characters long and are composedentirely of numeric characters, and add to 1d array 675.

Process 650 Generate BOM Table 262

The BOM table will contain BOM items 267, which are also known asreference characters, and are found throughout patent text as seen inFIG. 1B. Referring to FIG. 3F. first step 651, select the descriptionfrom document 132. Next, at step 652, search for words that start withnumbers and load them to a BOM Candidate Array 676. Next, search forwords that start with a left parenthesis and are immediately followed bya number, and add them to the BOM Candidate Array 676. Next, at step653, retrieve three words previous to each element in the BOM CandidateArray 676. Eliminate candidates where the preceding words contain wordssuch as fig, figure, or figs. Next, eliminate candidates that are notimmediately succeeded by a space, right parenthesis, period, or comma.Index with processed text 228. Next at step 654, load the remainingcandidate numbers into the BOM Table 642. Index BOM candidate array 676with processed text 228, such that mouse clicks in any row will causescrolling to BOM item locations in text processed 228. As seen in FIG.2K, the index provides the researcher with rapid scrolling to andbolding of the BOM item 267 that is clicked in the BOM table.

Process 660 Generate Figs Table 261

The Figs table will contain figure numbers 268, which are foundthroughout patent text as seen in FIG. 1B. Referring to FIG. 3G, firststep 661: Select description of document 132. Next, at step 662, searchfor words immediately preceded by words such as fig, fig., figure,figs., figs, and add to a figs candidate array 677. Next, at step 663Remove elements from the figs candidate array 677 that do not start witha number (i.e. allow 1, 2, 2C, 2D). Next at step 664 index withprocessed text 228, and load figs candidate array 677 and associatedindex (for quick mouse scrolling) into figs table 261. As seen in FIG.2J, the index provides the researcher with rapid scrolling to andbolding of the figure number 268 that is clicked in the figs table.

Referring now to FIG. 4, another exemplary method 400 for performingdocument analysis will now be discussed. Steps 410 through 440 proceedin a similar manner to steps 310 through 340 of the computer-implementedprocess 300. The present embodiment additionally provides an additionalstep 450 for receiving and storing data from the researcher thatindicates the determined relevancy of the currently selected document132 to the one or more user-defined concepts. As discussed, theinterface module 114 will populate the document management table 252(shown in FIG. 2E) with selectable rows 253 each having informationdescriptive of one of the received reference documents. In the exemplaryembodiment, the document management table 252 also includes one or moreadditional columns for allowing the researcher to indicate (by way of amouse-click or similar navigation event) the relevance of the currentlyselected document. Each row of the document management table 252 mayhave a relevancy value column 257 that contains an input field forindicating the overall relevancy of the associated reference document.By way of example the interface module 114 may provide the researcherwith the ability to select an indicia (e.g. using a drop-down menu list)such as “A” for highest relevance, “B” for suspected relevance, and “C”for uncertain relevance. Irrelevant documents may be marked with an “I”to place a marker in the file indicating that a reference document wasreviewed. Each row of the document management table 252 may also haveone or more additional columns labeled generally as 258 that contain aninput field for indicating whether a specific concept has been verifiedto appear in the currently selected reference document. The interfacemodule 114 may provide the researcher with the ability to toggle a field(one such field is labeled as 259) corresponding to a specific concept“on” or “off” (e.g. by a mouse-click) when indicating whether aparticular concept does or does not exist. A column may be provided foreach of the previously discussed concepts. However, in anotherembodiment the interface module 112 may provide the researcher with aconcept management window 270 (see FIG. 2F) for allowing the researcherto define different concepts 272 which the additional columns 258 may bederived from. In this manner, the researcher may be able to trackhigher-level or more abstract concepts than were initially defined andmay also provide more user-friendly naming of the concepts (useful, forexample, for report generation). The interface module 112 may also storethe previously discussed relevancy indicators in a data repository suchas the database labeled as 116 in FIG. 1. By storing each of theindicators the interface module 114 is able to generate reports that mayinclude a reduced, and more relevant set of reference documents 132,than was initially received by the client device 110. Steps 430 through450 may be repeated for each of the received reference documents 132 asindicated by dashed arrow 460.

In this manner a document analysis system is provided that includes acomputing device having program modules executable by a processor, theprogram modules configured to rapidly transform a first set of set ofdata files representative of a plurality of reference documents into asecond set of data files representative of a subset of the plurality ofreference documents, the subset having textual content particularlyrelevant to one or more received concepts.

Referring to FIG. 5, a block diagram is shown illustrating a documentanalysis system 500 in accordance with another exemplary embodiment ofthe invention. The document analysis system 500 is similar to thedocument analysis system 100 of FIG. 1 however provides a client-serverarchitecture. Accordingly, document analysis system 500 includes aclient device 510 and a server device 580. The server device 580 may bea computing device having a processor such as personal computer or maybe implemented on a high performance server, such as a HP, IBM or Suncomputer using an operating system such as, but not limited to, Solarisor UNIX. The server device 580 includes a document analysis module 512similar in function to the document analysis module of 112 of theembodiment of FIG. 1.

Thus, a document analysis system having the benefits of allowing forrapid and accurate assessment of the relevancy of a document or set ofdocuments to one or more concepts is contemplated. The document analysissystem receives one or more concepts along with one or more referencedocuments and generates various sensory indicators that assist aresearcher in assessing the relevance of each of the received documentsto the received concepts. In one aspect, the document analysis systemdisplays a table of keywords separated into blocks, each block ofkeywords corresponding to one of the concepts. The document analysismodule will highlight each block of keywords with a color, the colorbased on the highest count of a keyword within each group of keywords.The color of a block thus indicates the relative presence of a conceptin the document. In another aspect, the document analysis systemdetermines a unique color for each block of keywords and then displaysthe text of the reference document with each occurrence of a keywordhighlighted with the color of its associated keyword block. In thismanner a researcher can quickly identify passages that contain multipleconcepts.

1. A document analysis and search system for searching through aplurality of documents and for analyzing documents located as a resultof a conducted search, the search being directed to a predeterminedsubject matter, the system comprising: a program module storable on aclient device that includes at least one of a computer readable mediumand a memory, the client device being positioned in communication with anetwork, and the network being in communication with a document providerdatabase and a thesaurus database, the program module comprisinginstructions executable by a processor of the client device to locate atleast one document from among the plurality of documents, the programmodule comprising an interface module; and a document analysis module;wherein the interface module receives concept data relating to thesubject matter of the search, the concept data including at least oneconcept, the concept including a plurality of keywords used to conductthe search; wherein the interface module receives a plurality ofdocuments relating to the concept data from the document providerdatabase; wherein the interface module generates and displays a documentanalysis graphical user interface, the document analysis graphical userinterface comprising a keyword entry interface, a document relevancyinterface, a document management interface, and a document image window,wherein the document analysis module generates statistical data based onthe at least one concept, and wherein the statistical data is used toassess relevancy of each of the documents located in the search so thateach of the documents can be displayed using the document relevancyinterface; wherein the document analysis module transmits thestatistical data to the interface module to be displayed; wherein thekeyword entry interface allows entry of one or more keyword groups, andwherein each keyword group includes a plurality of keywords that areconceptually related to one another.
 2. A system according to claim 1wherein the statistical data includes a count of a number of instancesthat each of the keywords appears in a document located in the search.3. A system according to claim 1 wherein the interface module allows foreach of the documents located in the search to be manually assigned arelevancy value.
 4. A system according to claim 1 wherein the documentrelevancy interface includes a keyword table and a document text window;and wherein corresponding text relating to at least one of the documentsis displayed in the document text window.
 5. A system according to claim4 wherein the keyword table includes a first column to display thekeywords used to conduct the search, and a second column to display anumeric value relating to the number of times each keyword appears ineach of the documents located in the search.
 6. A system according toclaim 5 wherein the keywords in the keyword table are arranged inkeyword blocks; and wherein each keyword block includes a keyword group.7. A system according to claim 6 wherein the keyword table is colorcoded according to a block color scheme; and wherein the block colorscheme assigns a similar color to each keyword appearing in a keywordblock.
 8. A system according to claim 7 wherein the similar color isassigned according to the highest occurring keyword from the block.
 9. Asystem according to claim 8 wherein the block color scheme is acontinuum between a first predetermined color and a second predeterminedcolor, wherein the first predetermined color signifies zero occurrenceof the highest occurring keyword from the block, and wherein the secondpredetermined color signifies a high occurrence of the highest occurringkeyword from the block.
 10. A system according to claim 9 wherein adocument is considered relevant when colors other than the firstpredetermined color are displayed in the keyword table, and wherein adocument is considered irrelevant when the first predetermined color isdominant in the keyword table.
 11. A system according to claim 6 whereinthe keywords occurring in the corresponding text displayed in thedocument text window are color coded according to a document text colorscheme; wherein the document text color scheme assigns a similar colorto each keyword from a keyword group; wherein each keyword group isassigned a different color; and wherein the different colors are chosenbased upon ability to contrast against each other and against a whitebackground of the document text window.
 12. A system according to claim11 wherein a document paragraph is considered relevant when multiplecolors are displayed.
 13. A system according to claim 12 wherein thedocument text window displays document paragraphs having a predeterminedminimum number of different colors displayed therein; and wherein thepredetermined minimum number of different colors is controlled by akeyword setting on the document relevance interface.
 14. A systemaccording to claim 12 wherein the document text window displays documentparagraphs having a predetermined maximum number of different colorsdisplayed therein; and wherein the predetermined maximum number ofdifferent colors is determined by counting the number of keyword groupsrepresented in each paragraph and isolating the highest number ofkeyword groups represented in each paragraph.
 15. A system according toclaim 5 wherein the document relevancy interface further comprises afigs table; wherein the figs table displays one or more figure numberslocated in the corresponding text of the document; and wherein thelocation of the figure numbers in the corresponding text is indexed. 16.A system according to claim 5 wherein the document relevancy interfacefurther comprises a BOM table; wherein the BOM table displays one ormore reference characters located in the corresponding text; and whereinthe location of the reference character in the corresponding text isindexed.
 17. A system according to claim 1 wherein the interface moduleis adapted to store the documents located as a result of the search on aportable storage device.
 18. A system according to claim 1 furthercomprising a document search history display to display a historicalrecord of the search.
 19. A system according to claim 1 wherein thedocument graphical user interface further comprises a concept managementwindow to allow for different concepts to be defined while the search isbeing conducted.
 20. A system according to claim 1 wherein the documentrelevancy interface includes a keyword density scrollbar; wherein thekeyword density scroll bar has a vertical section for each paragraph inthe corresponding text; and wherein each vertical section is color codedaccording to the number of keyword groups represented in the paragraph.